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Abstract 

We study a set of linear transformations on the Fourier 
series representation of a sequence that can be used as 
the basis for similarity queries on time-series data. We 
show that our set of transformations is rich enough to 
formulate operations such as moving average and time 
warping. We present a query processing algorithm that 
uses the underlying R-tree index of a multidimensional 
data set to answer similarity queries efficiently. Our 
experiments show that the performance of this algo- 
rithm is competitive to that of processing ordinary (ex- 
act match) queries using the index, and much faster 
than sequential scanning. We relate our transforma- 
tions to the general framework for similarity queries of 
Jagadish et al. 

1 Introduction 

Time-series data are of growing importance in many 
new database applications, such as data mining or ware- 
housing. A time series is a sequence of real numbers, 
each number representing a value at a time point. For 
example, the sequence could represent stock or com- 
modity prices, sales, exchange rates, weather data, bio- 
medical measurements, etc. We are often inte r ested in 
similarity queries on time-series data [ APWZ95, ALSS95] 
For example, we may want to find stocks that behave 
in approximately the same way (or approximately the 
opposite way, for hedging); or stocks that increased lin- 
early up to October 1987, and then crashed; or years 
when the temperature patterns in two regions of the 
world were similar. In this type of queries, approximate 
rather than exact matching is required. 

A naive approach is to compute the Euclidean dis- 
tance (or any other distance, such as the city-block dis- 
tance) between two objects (in general) or two time 
sequences (in particular), and call two sequences sim- 
ilar if their distance is less than a user-defined thresh- 



old. Time sequences are usually long, so the distance 
computation can be time consuming. A solution is to 
map time sequences into the frequency domain using the 
Fourier transform, and use the first few coefficients to 
filter out non-similar data. This has the advantage that 
spatial indexing techniques can be used to index time 
sequences by viewing them as tuples of Fourier series 
coefficie nts, that is, points in a low-dimensional space 
[ A.FS93j|FRM9l . 

A problem with this approach is that the user has no 
control over the meaning of similarity other than pro- 
viding a threshold. There are many similarity queries 
that such a fixed predefined notion of similarity fails to 
capture; for example, one may consider two stocks simi- 
lar if they have almost the same price fluctuations, even 
though one stock might sell twice as much as the other. 
Consider the following motivating examples: 




Figure 1: (a) Time sequence si = (36,38,40,38,42,38,36, 
36,37,38,39,38,40,38,37), (b) time sequence sa = 
(40,37,37,42,41,35,40,35,34,42,38,35,45,36,34), (c) the 3- 
day moving average of si, and (d) the 3-day moving 
average of si 
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Example 1.1 Suppose si = (36,38,40,38,42,38,36,36, 
37,38,39,38,40,38,37) and si = (40,37,37,42,41,35,40,35, 
34,42,38,35,45,36,34) are two time sequences that cor- 
respond to the closing prices of two stocks. Looking at 
Figure |l|(a),(b), the sequences do not appear very sim- 
ilar. This is justified by the high Euclidean distance 
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D(sl,S2) = 11.92 between them. However, if we look 
at the three-day moving averages of the two sequences 
(Figure hi (c),(d)), they do look quite similar. The Eu- 
clidean distance between the three-day moving averages 
of two sequences is 0.47. 

Moving averages ar e wide ly used in stock data anal- 
ysis (for example, see [EM69]). Their primary use is to 
smooth out short term fluctuations and depict the un- 
derlying trend of a stock. The computation is simple; 
the i-day moving average of a sequence s = (vi, ■ ■ ■ ,v n ) 
is computed as follows: the mean is computed for an 
i-day-wide window placed over the end of the sequence; 
this will give the moving average for day n — \ l/2\; the 
subsequent values are obtained by stepping the window 
through the beginning of the sequence, one day at a 
time. This will produce a moving average of length 
n — I + 1. We use a slightly different version of moving 
average which is easier to compute in our framework. 
We circulate the window to the end of the sequence 
when it reaches the beginning. This gives us a moving 
average of length n. It turns out when the length of 
the window is small enough compared to the length of 
the sequence, which is usually the case in practice, both 
averages are almost the same. 



that an index structure for moving average can be con- 
structed on the fly from the existing index, and it can 
be used to speed up the query processing. We show 
that this approach not only is faster than sequential 
scanning, but also introduces no extra disk overhead. 
To the best of our knowledge, this is the first index- 
ing method that can handle moving average and time 
warping in the context of similarity queries. 

The organization of the rest of the paper is as fol- 
lows: The rest of the current section provides some basic 
material on the discrete Fourier transform and a survey 
of related work. Section ^ motivates the work by dis- 
cussing possible applications to stock data analysis. Our 
definition of similarity queries is discussed in Section ^. 
In section q we develop an indexing method for these 
similarity queries. Section ^ presents experimental per- 
formance results. We conclude in Section |^. 

1.1 The Discrete Fourier Transform 

In this section, we briefly review the Discrete Fourier 
Transform (DFT) and its properties. Let a time se- 
quence be a finite duration signal x = [x t ] for t = 
0, 1, • • • , n — 1. The DFT of x, denoted by X, is given 
by 



time 



time 
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Figure 

2: (a)Time sequence s = (20,20,21,21,20,20,23,23) 
(b)time sequence p = (20, 21, 20, 23) 



Example 1.2 Consider two time sequences s= (20,21 
,21,21,20,21,23,23) andp = (20,21,20,23) that are sam- 
pled with different frequencies (Figure pi). For example, 
s could be the closing price of a stock taken every day, 
and p could be the closing price of another stock taken 
every other day. A typical query is "is p similar to s 
?" . The sequence s is twice as long as p, so they cannot 
be compared directly. The Euclidean distance between 
p and any subsequence of length four of s is more than 
1.41. If the time dimension of p is scaled by 2, i.e., every 
value u Vi" is replaced by u Vi,Vi" , the resulting sequence 
will be identical to s. This op eration is usually called 
time warping (for example, see |SK83|). 



We propose a class of transformations that can be 
used in a query language to express similarity in a fairly 
general way, handling cases lik e the two examples above. 
Given an R-tree index [Gut84] constructed on a data set, 
we describe a fast query processing algorithm that uses 
the index to filter out unrelated data from the answer 
set of a similarity query. For example, we demonstrate 



n-i 

X / = - 7 =>Jite— s— / = 0, 1, 
v t=o 



• , n — 1 



(1) 



where j = %/— 1 is the imaginary unit. Throughout this 
paper, unless it is stated otherwise, we use small letters 
for sequences in the time domain and capital letters for 
sequences in the frequency domain. The inverse Fourier 
transform of X gives the original signal, i.e., 



1 \ -V „ 327rt/ 

x t = — > X f e - i = 0,l, 



~~/= 7 -V 



, n — 1 



(2) 
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1 / y/n in the front of both Equations hi and 
of signal x is given by the expression 



E(x) = K 



he energy 



(3) 



The convolution of two signals x and y is given by 

n-l 

Conv(x, y)i = 2_, x kVi-k i = 0, 1, • • • , n - 1 

fc=0 



(4) 



where i — /c is computed modulo n. This convolution is 
usually called circular convolution. Equations ^ and |i| 
are unchanged in the frequency domain. 

The following properties of DFT can be foun d in an y 
signal processing textbook (for example, see [ OS75[ ). 
The symbol <S> denotes a DFT pair. 



Linearity 

ax + by <4> aX + bY 
for arbitrary constants a and b, 



(5) 
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Convolution-Multiplication 

conv(x, y) <4> X * Y 



(6) 



where X * Y is the element-to-element vector mul- 
tiplication of two vectors X and Y , and 



Parseval's Relation 



(7) 



Using Parseval's relation, it is easy to show that the Eu- 
clidean distance between two signals in the time domain 
is the same as their distance in the frequency domain. 



D(x, y) = (E(x - y)) 1/2 = (E(X - Y))^ = D(X, Y) 

(8) 

A nice property of the DFT is that for a large family 
of sequences it concentrates the energy in the first few 
coefficients. Thus using the first few coefficients for in- 
dexing introduces few false hits, and no false dismissals. 

1.2 Related Work 

There has been some work on access methods for sim- 
ilarity queries. For example, Agrawal et al. [ AFS93[ 
propose an efficient index structure to retrieve time se- 
quences similar to a given one. They map time se- 
quences into the frequency domain using the Fourier 
transform and keep the first few coefficients in the in- 
dex. Two sequences are considered similar if their Eu- 
clidean distance is less than a user-defined threshold. 
One difficulty with this approach is that the user has 
no control over t he meani ng of similarity. 

Jagadish et al. |jMM95| ] develop a domain-independent 
framework to pose similarity queries on a database. The 
framework has three components: a pattern language P, 
a transformation rule language T, and a query language 
L. An expression in P specifies a set of data objects. An 
object A is considered similar to an object B, if B can be 
reduced to it by a sequence of transformations defined 
in T. The query language proposed in the paper is an 
extension of relational calculus with predicates that test 
whether an object A can be transformed into a member 
of the set of objects described by expression e using the 
transformation t, at a cost bounded by c. A special- 
ization of this work to real-valued sequences where the 
search is performed over sequence sig natures ins tead of 
the original sequences is described in [FJMM95]. 

In this paper, we des cribe an efficient implementa- 
tion of a special case of JMM95] for time-series data. 
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We only study the trivial pattern language where a pat- 
tern expression specifies either a given constant data ob- 
ject, or every object in the database. Given an object o, 
a pattern expression e that denotes a set of objects, and 
a transformation t, the expression t(e)Q denotes the set 
of all objects that can be obtained by applying t to every 
member of the set defined by e. We consider three kinds 
of queries: range queries, all-pair queries, and nearest 
neighbor queries, and we allow our transformations to 
be used in those queries. 

We show how to use the indexing method in ]AFS9| 
to test for similarity under a general class of transfor- 
mations. 



Our work generalizes Goldin et al. [GK95 where 
transformations are limited to shifts and positive scales. 
Our extension allows shifts and scales in every dimen- 
sion of a multidimensional feature space, as well as more 
complex transformations such as moving average. In ad- 
dition, we drop the restriction to positive scales. Some 
advantages of these extensions are shown within the 
next section. 

There are ot her relate d works on time series data. 
Agrawal et al. [APWZ95] describe a pattern language 



called SDL to encode queries about "shapes" found in 
time series. The language allows a kind of blurry match- 
ing where the user specifies the overall shape instead of 
the specific details, but it does not support any oper- 
ations or transformations on the time series. A query 
language for time ser ies dat a in the stock market do- 
main is describe d in Jfiot93[ . The language is built on 
top of CORAL JrS9§7 and every query is translated 
into a sequence of CORAL rules. 

2 Examples from Stock Data Analysis 

In this section we demonstrate how our transformations 
can be used to eliminate noise or short-term fluctua- 
tions and shift or scale the data before computing Eu- 
clidean distances. We use three examples from real 
stock data. The data was obtained from the FTP site 
"ftp.ai.mit.edu /pub / stocks /results/" . 

Example 2.1 Figure shows the daily closing price for 
The Bombay Co. (BBA) starting from October 25th, 
1994 for 128 days, and that for Zweig Total Return Fund 
Inc. (ZTR) starting from July 20th, 1995 for 128 days. 
The Euclidean distance between two series is 16.16. The 
mean for BBA is 9.51, and the mean for ZTR is 8.64. 
If we shift the mean of both series to zero, i.e, subtract 
the mean of each series from everyday closing price, the 
Euclidean distance reduces to 12.78. The closing price of 
ZTR fluctuates in a smaller range than that of BBA; the 
standard deviation for ZTR is 0.10 while the standard 
deviation for BBA is 1.18. We scale both series by the 
inv erse of their standard deviation. The resulting series 
m jCK95| ] are called normal forms of the original series. 
Thus given any sequence s, sequence s' is the normal 
form of s if 



- mean(s) 
std(s) 



for i = 1, 



, length(s) (9) 



1 This is called e » t in [JMM9 



Figure H shows that the Euclidean distance between the 
normal forms of two series is still 11.10; time series ZTR 
is more volatile than BBA. To smooth out short term 
fluctuations, we take the 20-day moving average of the 
two series. The Euclidean distance drops to 2.75. 



Example 2.2 This example shows how we can iden- 
tify series that have opposite price movements. Fig- 
ure ^ shows the daily closing price for Circuit City Stores 
Inc. (CC) (marked by dotted lines) and the daily clos- 
ing price for Varian Associates Inc. (VAR) (marked by 
solid lines) both starting from August 30, 1993 for 128 
days. As Figure ^ shows, the two series have a reverse 
movement; when the price for CC goes up, the price for 
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Figure 3: From left to right, top to bottom: the daily closing price for The Bombay Co. (BBA) starting from 
"94/10/25" for 128 days, the daily closing price for Zweig Total Return Fund Inc. (ZTR) starting from "95/07/20" 
for 128 days, the two stocks put together, both shifted, both scaled, and the 20-day moving average (D denotes the 
Euclidean distance) 



original series: D=1 19.59 normal form: D=21 .81 
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Figure 4: From left to right, top to bottom: the daily closing prices for Circuit City Stores Inc. (CC) (marked by 
dotted lines) and Varian Associates Inc. (VAR) (marked by solid lines) both starting from "93/08/30" for 128 days, 
their normal forms, VAR reversed, and the 20-day moving averages of both series (D denotes the Euclidean distance) 
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VAR goes down and vice versa. The Euclidean distance 
between two series is 119.59. We transform both series 
to their normal form, and the Euclidean distance be- 
comes 21.81. If we reverse the time series of VAR, i.e., 
multiply everyday closing price by -1, and then take 
the 20-day moving average of both series, the Euclidean 
distance will be 3.81. 

One might think that applying these transforma- 
tions, any two series can be made similar. The next 
example shows this is not the case. Given three op- 
erations: shift, scale, and 20-day moving average, we 
can use shift and scale to transform two series to their 
normal forms. We can smooth the normal form series 
using 20-day moving average. Each one of these oper- 
ations may reduce the distance between two series, but 
two series that have dissimilar trends still look different. 
It is obvious that if we keep taking the moving aver- 
age, two series eventually will be the same, i.e two fla t 
straight lines. However, we assume, following PMM95| , 
that each operation has a cost, and we are limited by 
an upper bound on the total cost. This upper bound, 
for example, could be proportional to the Euclidean dis- 
tance between the two original series. 

Example 2.3 Figure [| shows the daily closing prices 
of Digital Microwave Corp. (DMIC) and The Mexico 
Fund Inc. (MXF) both starting from August 30,1993 for 
128 days. The Euclidean distance between the normal 
forms of two series is 11.06. The Euclidean distance 
after taking the 20-day moving average becomes 10.09. 
The Euclidean distances after taking the second and the 
third 20-day moving average are respectively 9.63 and 
9.22. The Euclidean distance even after taking the 10th 
moving average is still 6.57. 

In the next section we encode the transformations 
discussed here in a query language. 

3 Similarity Queries 

We consider an object to be a point in a multidimen- 
sional space (md-space). For non-point objects, we as- 
sume there is a mapping function that maps every ob- 
ject to a point in the md-space. Such a function is 
developed in many domains where multidimensional in- 
dexing has bee n used. For example, Fourier transform 
for time-series JAFS94 and minimum bounding rectan- 
gle for shapes [ Jag91 [are some instances of the mapping 
function. 

A transformation in an n-dimensional space, denoted 
by (a,b), is a pair of n X 1 vectors where a specifies a 
stretch and b represents a translation (Figure ^). The 
transformation (a, b) applied to a point x in some space, 
maps x to a * x + b which is a point in the same space. 
Transformations may be associated with costs. Given a 
set of transformations t, and the cost of applying each 
transformation, a measure of distance (dissimilarity) be- 



(a) a translation 



(b) a stretch 

Figure 6: 

tween two objects can be defined as follows: 
Do(x,y) 

min Tet (cost(T) + D{T(x),y)) 
D(x, y) = min ^ minT£t(cost(T) + D(x, T(y))) 
rninTx T 2 et(cost(Ti) + cost(T2) 
+D(T l (x),T 2 (y))) 

(10) 

where Do(x, y) is the Euclidean distance between x and 
y. We use T(x) to denote "transformation T applied 
to a point x" and T(r) to denote "transformation T 
applied to a relation r". The former returns a point 
while the latter returns a relation. We assume relations 
are unary, that is, they are simply sets of sequences; in 
practice of course they may have other attributes, such 
as source of the data, time period covered, etc. 

In the domain of time series data, both objects and 
transformations can be vectors of complex numbers, so 
we need to extend transformations to complex numbers. 
On the other hand, we want to make sure that this ex- 
tension still keeps the main properties we are interested 
in. The following definition describes these properties. 

Definition 1 A transformation T in a multidimensional 
space S is safe if T maps every rectangle R in S to a 
rectangle R' in S, every point inside R to a point inside 
R' , and every point outside R to a point outside R' . 



Theorem 1 Transformation T — (a,b) is safe if a and 
b are chosen to be vectors of real numbers. 

Proof (sketch): Transformation T here is the compo- 
sition of a stretch and a translation in every dimension. 
Thus T is safe, g 

In the next section, we study the safety condition for 
complex numbers in more detail. 
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Figure 5: From left to right, top to bottom: the daily closing price of Digital Microwave Corp. (DMIC) starting from 
"93/08/30" for 128 days, the daily closing price of The Mexico Fund Inc. (MXF) for the same period, their normal 
forms, and the 20-day moving averages of both series (D denotes the Euclidean distance) 



3.1 Transformations on Time Series 

We consider a time series to be a point in a multidi- 
mensional feature space. We have chosen the first k 
Fourier coefficient of a time series as our features. The 
reason for choosing DFT is mainly because: (a) DFT 
concentrates the energy in the first few coefficients, so 
those coefficients can make a k ey f or indexing purposes; 

it is known that the 
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(b) as we remarked in Section 
Euclidean distance is unchanged under the DFT. Since 
a Fourier coefficient, in general, is a complex number, 
we need a representation for complex numbers in our 
feature space. One possibility is to decompose a com- 
plex number into its real and imaginary components, 
and map each component to a dimension. For a given 
set of k features, we represent it with a point in a 2k- 
dimcnsional space. We denote the space built using this 
representation by Srect- An alternative representation is 
to decompose a complex number into its components in 
the polar coordinate system. A complex number in the 
polar coordinate system is represented by a magnitude 
and a phase angle. We denote the space built using this 
representation by S po i. We use Re(x), Im(x), Abs(x), 
and Angle(x) to denote respectively the real, the imag- 
inary, the magnitude, and the phase angle of a complex 
number x. Now we need to show that the transforma- 
tions described for time series data in previous sections 
are safe. 

Theorem 2 Let a be a vector of real numbers, and b be 
a vector of complex numbers; the transformation T = 
(a,b) is safe with respect to Srect- 

Proof: We need to prove that T maps every rectangle 
R in the space to a rectangle R' , all interior points of 
R to interior points of R' , and all exterior points of 
R to exterior points of R' . Without loss of generality, 
we assume dimensions 2i — 1 and 2i (for i — 1, • • • , k) 
are respectively used for real and imaginary components 
of feature i. Suppose fc is a fc-dimensional vector of 
complex numbers and x, a 2fc-dimensional vector, is its 
representation in Srect- We have xci = X2i-i + X2ij for 
i = 1, • • • , k. If we apply transformation T to xc, we get 



xcl — T(xk) = d*xb+b. We can rewrite this as follows: 

xc[ = a l * (xzi-i + x 2 ij) + {Re(bi) + Im(bi)j) 
= (a« * X2i-i + Re(bi)) + (a< * x 2 i + Im(bi))j 

for i = 1, • • • , k. If we map the resulting vector to a point 
x in Srect, we get a^i-i = ai*x2i-i + Re(pi) and x' 2i = 
ai * X2i + Im(bi) for i = 1, • • • , k. This transformation 
can be rewritten as T' = (c, d) where C2i-\ = C2i = fli, 
d2i-\ = Re(bi), and d2i = Im(bi) for i = 1, • • • , k. Since 
cand d are vectors of real numbers, the rest follows from 
Theorem g 

On the other hand, Theorem |^ does not hold if a 
is chosen to be a vector of complex numbers. For ex- 
ample consider a two dimensional rectangle with point 
p = —5 — 5j as its lower left corner and point q — 5 + 5j 
as its upper right corner, and r = — 2 + 2j as a point 
inside the rectangle. If we multiply the complex num- 
bers representing the three points by s = 2 — 3j, the 
transformed rectangle built on points p * s = —25 + 5j 
and q * s = 25 — 5j does not have point r * s = 2 + lOj 
inside! 

Theorem 3 Let a be a vector of complex numbers, and 
b be a zero vector (b = 0); the transformation T = (o, 6) 
is safe with respect to S po i . 

Proof: Without loss of generality, we assume dimen- 
sions 2i — l and 2i (for i = 1, ■ • ■ , k) are respectively used 
for magnitude and phase angle of feature i. Suppose xc 
is a fc-dimensional vector of complex numbers and x is 
its coordinate in S po i- We have xci = X2i-ie X2i3 for 
1, • • • , k. If we apply transformation T to xc, we get 
= T(xb) = a*xb + b. We can rewrite this as follows: 



l = 
xcf 



XCi 



Abs(a, 



)e " y %,J * X2i 



■ie 



+ 



= (Abs(a t ) * x 2 i-i)e (X2i+Ansleiai » :i 



for i — 1, ■ ■ ■ , k. If we map the resulting vector to a point 
x in Spot, we get x' 2 i-i = Abs(a,i) * x-a-i and x' 2i = 
X21 + Angle(ai) for i — 1, • • • , k. This transformation 
can be rewritten as T' = (c,d) where C2i-\ = Abs(ai), 
d,2i-i — 0, C2i = 1, and d2i = Angle{a,i)iox i = 1, • • • , k. 
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6 — asin( £ / m) 



Figure 7: Minimum bounding rectangle in the polar 
coordinate system 



Since c and d are vectors of real numbers, the rest follows 
from Theorem [[]. g 

Given a query point q in a 2fc dimensional space 
and a threshold e, we need to build a search rectangle. 
A search rectangle is the minimum bounding rectangle 
that contains all points within the Euclidean distance 
e from q. It is straightforward in the rectangular co- 
ordinate system; the minimum bounding rectangle is 
(qi — e,qi + e) for i = 1, • ■ • , 2k. The minimum bounding 
rectangle for a complex number me a3 in the polar coor- 
dinate system is demonstrated in Figure [?]. The magni- 
tude is in the range from m-etom + e, and the angle is 
in the range from a-asin(-) to a+asin( — ) where asin 
denotes the arc sinus of an angle. If we again assume di- 
mensions 2i — l and 2i (for i — 1, • • • , k) are respectively 
used for magnitude and phase angle of feature i, then 
the minimum bounding rectangle for q in the polar coor- 
dinate will be (qt — e, qt+e) for i = 1, 3, 5, • • • , 2k — 1 and 
(qi — asin(—^—), qi + asin(—^—)) for i = 2,4, 6, •• • , 2k. 



3.2 Using Transformations to Express Similarities 

To gain some insight into the transformations, we for- 



malize the notion of similarity expressed in Example L 
Time series si is considered similar to si because th 
3-day moving averages look the same, so we need to 
formulate the 3-day moving average in our transfor- 
mation language. For simplicity, in our examples we 
assign a cost of zero to all transformations. Let us 
denote the Fourier transform of si by Si, the Fourier 
transform of si by 5*2, and the Fourier transform of 
™3 = (§,§, |,0,0,0,0, 0,0, 0,0,0,0,0,0) by M 3 . Now 

mavg-j, = (A? 3 ,0) where 



^3' 3' 3 

consider the transformation T, 
is a zero vector of the same size as Mz 
transformation T mavg z to Si, i.e., 



If we apply the 



T ma „ s3 (Si) = Si * Ms + = Si * M 3 

we get the 3-day moving average of si in the frequency 
domain. If we transform the right hand side back to the 
time domain using the convolution-multiplication rela- 
tion (Equation ^), we get T ma „ 9 3(s!) = conii(sl,m3) 
which is the 3-day moving average of si in the time do- 



main. The same transformation can be applied to si to 
compute its 3-day moving average. 

The notion of similarity can be expressed in a query 
by the proper choice of transformations. For example, 
the m-day moving average of a series of length n gener- 
ally can be expressed by T ma v g = (a, 0) where 

a — (wi — 1/m, W2 = 1/m, • ■ ■ , w m = 1/m, 0, 0, ■ ■ ■ , 0) 



(11) 

and is a zero vector of size n. Transformation T mavg 
may be applied several times to get successive moving 
averages. The weights u>i, - ■ ■ ,w m are not necessarily 
equal. For trend prediction purposes, for example, the 
weights at the end are usually chosen to be higher than 
those at the beginning. Whereas for normal smoothing 
purposes, weights are equal, or those at the center are 
larger than those at the endpoints. 

To give another example of the transformations, we 
formulate the transformation used to reverse a time se- 
ries in Example UJA. Let T rev = (a, 0) where a% = — 1 
for i = 1, ■ • ■ , 128 and is a zero vector of size 128. 
Now consider a time series s and its Fourier transform 
S. Transformation T rev applied to S gives 

T rev (S) =a*S + = -S. 

If we transform both sides into the time domain using 
Equation ^, we get T rev (s) — —s. That is, the daily 
closing price is multiplied by —1. 

Transformation T rev can be used to obtain all the 
pairs of series that move in opposite directions. This 
can be formulated in our query language for a given 
relation r as a spatial join between r and T rev (r). 

Transformations can a lso be defined to stretch the 
time dimension (Example |l.2[ ) . Details of this transfor- 
mation are given in Appendix |aJ In the next section, 
we discuss an indexing technique for similarity queries. 

4 Indexing of Similarity Queries 

In this section we describe a fast query processing method 
for similarity queries. We assume a multidimensional in- 
dex is available, and we take an advantage of that in our 
query processing. Because of the dominant use of the 
R-tree family in multidimensional indexing, we describe 
our approach for R-tree indexes. 

Given an R-tree index / in a multidimensional space 

5 over a data set D, and any safe transformation T in 
S, we give an algorithm to construct an R-tree index I' 
for T(D). 



Algorithm 1 : For every node n 



n = ((MBRi, pointer i), ■ • 
in /, we construct a node n' 

n = T(n) 

= ((MBR[, pointer'^). 



(MBRm, pointer m )) 



, (MBRm, pointer' m )) 



in I' such that MBR^ = T(MBRi), and pointer^ is a 
pointer to T(rii) where rii is the child node (or the data 
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tuple when n is a leaf node) pointed by pointeri (for 
i = 1, ■ ■ • , m). We assumed T is a safe transformation, 
therefore MBE! t is a bounding rectangle for all rectan- 
gles of the child node (or the data tuple) T(n;). The 
construction stops when every node in I is mapped to 
a node in /'. 

There are many possible R-tree indexes on T(D), 
each with a different performance. Our experiments 
show that the index we build here has a similar perfor- 
mance to that of the original index. The main observa- 
tion here is that for a given index I and transformation 
T, index I' can be built on the fly without having much 
impact on the performance of the search. This allows 
us to use one index for many transformations. 

An ind exing method for time series data is described 
m ]AFS93t - This method requires a cut-off point for the 
number of Fourier coefficients kept in the index. We 
denote this cut-off point by k and call the index built 
on the first k Fourier coefficients 'A;- index'. 

To demonstrate the query processin g al gorithm, we 



1.1 



throughout 
that the m- 



use a more general form of Example 
this section. We have seen in Section j| 
day moving average of a series can be expressed by 
Tmavg = (a, 0) where a is a vector of complex num- 
bers. Due to Theorem ^, T mavg is safe if we represent 
complex numbers in the polar coordinate. 

Query: Given a pattern expression e, a safe 
transformation T, an object q, and a thresh- 
old e, find all objects o £ T(e) such that the 
Euclidean distance D(o, q) < e. 

If the pattern expression e denotes only one object 
di, we simply apply T to di. Object di is in the an- 
swer set if D(6{,q) < e. Now suppose the expression 
e denotes all objects in the relation. A naive evalu- 
ation requires reading the whole relation, applying T 
to every object, and choosing every object o such that 
D{o, q) < e. This is a costly process. A better approach 
is to use Algorithm [j] to construct a new index for trans- 
formed objects, and this new index can be built on the 
fly during the search operation. The search algorithm 
for the given range query is as follows: 

Algorithm 2 : Given a k- index whose root is N, a 
transformation T, a threshold e, and a search point q, 
apply T to all points in the index and find those whose 
distance from q becomes less than e. 

1. Preprocessing: 

(a) Transform T and q into the frequency domain 
if they are in the time domain. Let us denote 
the first k Fourier coefficients of T and q by 
Tk and q\ respectively. 

(b) build a search rectangle q re ct for qu as de- 
scribed in Section 3.1. 



2. Search: 

(a) If N is not a leaf, apply T to every (rectangle) 
entry of N and check if the resulting rectan- 
gle overlaps q re ct- For all overlapping entries, 
call Search on the index whose root node is 
pointed to by the overlapping entry. 



(b) If N is a leaf, apply T to every (point) entry 
of N and check if the resulting point overlaps 
greet. If so, the entry is a candidate. 

3. Postprocessing: 

(a) For every candidate point, check its full database 
record to determine if its Euclidean distance 
is at most e from <f. If so, the entry is in the 
answer set. 

Similarly all-pairs queries and nearest neighbor queries 
can be processed efficiently using the index. For an all- 
pairs query, we do a spatial join using the index. The 
only difference here is that we transform all objects used 
in the join predicate before we compute the predicate. 
For example, the join predicate n bj 7^ may be 
changed to T(a;) n T(bj) 7^ where T is a transfor- 
mation and at and bj are members of two spatial sets. 
For a nearest neighbor query, the search starts from the 
root and proceeds down the tree. As we go down the 
tree, we apply T to all entries of the node we visit. We 
can then use any kind of met ric (such as MINDIST or 
MINMAXDIST discussed in |RKV95| ) for pruning the 
search. 

The only thing left to show is that this search scheme 
used with a fc-index misses no object from the answer 
set. 

Lemma 1 The k-mdex approach enhanced with trans- 
formations always returns a superset of the answer set. 



Proof: Suppose we want to find all objects x in a 
relation that are similar to a query object q. Since 
transformations are applied to series in the frequency 
domain, this can be written in the frequency domain as 
follows: 

D(T(X),Q)<e (12) 

where T = (A, B) is a transformation, e is a user-defined 
threshold, and X and Q are DFTs of respectively x and 
q. Applying T to X, we get 

n-l 

D(A * X + B, Q) = (J2 \A f X f + B f - Q f \ 2 )? < e 
f=o 

If we keep only the first k < n coefficients, we have 
fe-l n— 1 

(J2\AfX f + B f -Q f \ 2 )i < (J2\AfX f +B f -Q f \ 2 )i 

(13) 



On the other hand, the equation 



{^AjXf + Bf-Qf?)* <6 (14) 



f=o 



holds for all objects in the answer set. Equations (J13[,|14 
imply that 



(J2 \AfX f + B 



f - Wf\ 



)3 < e 



(15) 



f=o 
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Therefore, keeping the first k coefficients introduces no 
false dismissals. ■ 



This is a generalization of the result of [AFS93] for 
fc-index enhanced with transformations. 



5 Experiments 

We implemented our method on top of Norber t Beck- 
mann's Version 2 implementation of the R*-tree [BKSS90 
We ran experiments on both stock prices data obtained 
from the FTP site "ftp.ai.mit.edu/pub/stocks/results/" 
and synthetic sequences. Each synthetic sequence x = 
[xt] was a random sequence produced as follows: 



= y 

X\ = Xo + Z\ 



Xi 



-1 + z t 



where y was a normally distributed random number in 
the range [20,99], and zt (t = 1,2,- ■■) was a random 
number in the range [—4, 4] . 

We used the polar representation of complex num- 
bers because vector multiplication for time series data 
seemed to be more important than vector addition, and 
due to Theorem ^ vector multiplication is safe with re- 
spect to Spoi- For every time series, we first transformed 
it to the normal form, and then we found its Fourier co- 
efficients. The reason for choosing the normal form was 
because both the average and the standard deviation 
of a series could be stored in the index as two separate 
dimensions, and despite using the polar representation, 
we could still have simple shifts. Since the mean of a 
normal form series is zero by definition, the first Fourier 
coefficient is always zero, so we can throw it away. We 
mapped the mean and the standard deviation of the 
original time series respectively to the first and the sec- 
ond dimensions of the index. We also mapped the mag- 
nitude and the phase angle of the second DFT term 
(computed for the normal form series) respectively to 
the third and the fourth dimensions of the index, and 
the magnitude and the phase angle of the third DFT 
term respectively to the fifth and the sixth dimensions. 



: queries using index with transformations 
: queries using index with no transformations 



(b) a range query that uses no transformations. We var- 
ied the length of the sequences from 64 to 1024 while we 
kept the number of sequences fixed to 1,000. In order to 
have a precise comparison, the identity transformation 
Ti — (I, 0) was chosen such that Ti(o) — o for all objects 
o (I is a vector of l's and is a vector of 0's). This made 
the two queries produce the same results. As Figure |^ 
shows the difference between the two curves is only a 
constant. This constant is the CPU time spent for vec- 
tor multiplication which is unavoidable. The number of 
disk accesses is the same in both cases. 



: queries using index with transformations 
: queries using index with no transformations 
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Figure 9: time per query varying the number of se- 
quences 

In the next experiment, we kept the sequence length 
fixed to 128 while we varied the number of sequences 
from 500 to 12,000. We used the identity transforma- 
tion again for the same reason described in the previous 
experiment. As demonstrated in Figure [], the result 
was the same. Thus the index traversal for similarity 
queries does not deteriorate the performance of the in- 
dex. 
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: queries using index with transformations 
: queries using sequential scanning with transformations 
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Figure 10: time per query varying the sequence length 



200 400 600 800 1000 1200 

Sequence Length 

Figure 8: time per query varying the sequence length 

Figure H compares the execution time for two kinds 
of queries: (a) a range query using transformations and 



Figures |l0| and |ll| compare the execution time of our 
approach to sequential scanning. To have a good imple- 
mentation for the sequential scan, we stop the distance 
computation process as soon as the distance exceeds e. 
In addition, we do the sequential scanning on the rela- 
tion that stores the series in the frequency domain, not 



21 



: queries using index with transformations 

: queries using sequential scanning with transformations 
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Figure 11: time per query varying the number of se- 
quences 



the time domain. Because each series in the frequency 
domain has its larger coefficients at the beginning, the 
distance computation process can skip many sequences 
within the first few coefficients. Both graphs show the 
superiority of our approach. 



— : queries using index with transformations 

— : queries using sequential scanning with transformations 
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Figure 12: time per query varying the size of the answer 
set 

In another experiment, we kept the number of se- 
quences fixed to 1067, and we also kept the sequence 
length fixed to 128. The experiment ran on the real 
stock data. We varied the threshold so that the query 
gave us different numbers of time series in the answer 
set. Figure [12j shows that the index performs better un- 
til the size of the answer set gets larger than 300 which 
is almost one third of the size of the relation. 

Our last experiment was on a spatial self-join. In 
fact, all the time series used as examples in Section |^ 
were the results of this spatial join. We did the test 
using the following methods: 

a scan the relation of Fourier coefficients sequentially, 
and compare every sequence s to all the sequences 
that are after s in the relation; the transformation 
T ma v a 20 is applied to every sequence during the 
comparison; 

b do the sequential scanning as instructed in a, but stop 
the distance computation as soon as the distance 



exceeds e. 

c scan the relation of Fourier coefficients sequentially, 
and for every sequence build a search rectangle 
and pose it to the index as a range query; 

d do the spatial join as described in c, but in every 
index retrieval apply T maV g2o to both the index 
and the search rectangles. 

The experiment ran on a relation of stock prices data 
that had 1067 time sequences, and the length of each 
sequence was 128. The result of the test is shown in 
table [j]. Because of the implementation, b is 10 times 



algorithm 


time 

(mimsec.milisec) 


size of the 
answer set 


a 


20:36.323 


12 


b 


2:31.217 


12 


c 


0:10.139 


3x2 = 6 


d 


0:17.698 


12 x 2 = 24 



Table 1: The result of the join 

faster than a. Methods c and d are 9 to 15 times faster 
than b because of using the index, and d is a bit slower 
than c because of the use of a transformation and the 
size of the answer set. The answer set of d contains 
every pair twice, so it is twice the size of a and b. The 
size of the answer set for c is smaller because it does not 
use the transformation. 

6 Conclusions 

We have proposed a class of transformations that can 
be used in a query language to express similarity among 
objects. This class allows the expression of several prac- 
tically important notions of similarity, and queries using 
this class can be efficiently implemented on top of any 
R-tree index. One potential application which is empha- 
sized in the paper is stock data analysis, but we believe 
other domains can also benefit. Our contributions can 
be summarized as follows: 

• Formulation of moving average, time warping, and 
reversing in our transformation language. 

• Implementation of similarity matching under these 
transformations on top of an R-tree index. 

The experiments show that execution time of our 
method is almost the same as that of accessing the index 
with no transformations; our method has much better 
performance than sequential scanning, and the perfor- 
mance gets better by increasing both the number and 
the length of sequences. 

We think the normal form of [ G!K95|] is a useful repre- 
sentation for time series data, but it allows only a small 
fraction of similarity queries. Our similarity transforma- 
tions allow more general queries, bu t for sim ple shifting 
and scaling, the indexing method in | GK95 1 is faster be- 
cause no transformation needs to be performed on the 
index. Our in dexing technique can be easily built on 
top of |GK95| as we did in our experiments, allowing 
both simple sniffs and scales and more general transfor- 
mations to be applied efficiently. 
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A Time Warping 

Given the first k < n Fourier coefficients of a time series 
s of length n and an integer m > 1, we can construct 
the first k Fourier coefficients of time series s' of length 
my, n using a transformation T = (a, 0) such that 

s mi = s mi + l = ■ • ■ = S m (j + l)_l = Si (f6) 



Let s = (si, S2, ■ ■ ■ , s n ) be a time sequence, and S = 
(Si, S2, ■ • ■ , S n ) be its DFT. Using Equation m, we can 
write 

n-1 

S f = -^y"s t e^ 1L f = 0,l,...,n-l. (17) 

v t=o 

We want to find a vector a such that 

a f *S f = S' f for / = 0,-.-,fc-l. (18) 

where 5^ is the /th Fourier coefficient of S 1 '. Using Equa- 
tion [I], we can write S'f as follows: 



mn — 1 

-j2irtj 

/n 

t=o 



mn — 1 

5 / = ^ E s ' e 



This can be rewritten as 

m — 1 2m— 1 

£ = i=m 

+ 2^ s * e m " )• 

t=(n — l)m 

If we rewrite all summations as Y^tLo ant ^ a ^ so use 
Equation [H| we get 



m — 1 m— 1 

-j2ir(t + m)/ 

+ > sie m " +• 

/n ■ ' 

t=o t=o 



m — 1 m—1 
/ 1 \ -t -j2irt/ X ■> 



m— 1 

-j2ir(t+(n-l)m)/ | 



We can take 5Zt-o e m " ou ^ °^ ^ e P aren thesis; that 
gives 

m — 1 

-j27T(r»-l)/ , 

e " 



b t = — = > e ™" (so+sie « + 
v t=o 

Using Equation we can rewrite this equation as fol- 
lows: 

m — 1 
t=0 

Therefore, if we choose vector a as follows: 

m— 1 

«/ = J]e^ 1 /or / = (),•••, (19) 

t=o 

then the Equation [18 holds. 
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